AITopics | shallow neural network

While significant theoretical progress has been achieved, unveiling the generalization mystery of overparameterized neural networks still remains largely elusive. In this paper, we study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and stochastic gradient descent (SGD) to train SNNs, for both of which we develop consistent excess risk bounds by balancing the optimization and generalization via early-stopping. As compared to existing analysis on GD, our new analysis requires a relaxed overparameterization assumption and also applies to SGD. The key for the improvement is a better estimation of the smallest eigenvalues of the Hessian matrices of the empirical risks and the loss function along the trajectories of GD and SGD by providing a refined estimation of their iterates.

gradient method, name change, stability and generalization analysis, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.56)

Add feedback

A Dynamical Central Limit Theorem for Shallow Neural Networks

Neural Information Processing SystemsDec-24-2025, 22:27:51 GMT

Recent theoretical work has characterized the dynamics and convergence properties for wide shallow neural networks trained via gradient descent; the asymptotic regime in which the number of parameters tends towards infinity has been dubbed the mean-field limit. At initialization, the randomly sampled parameters lead to a deviation from the mean-field limit that is dictated by the classical central limit theorem (CLT). However, the dynamics of training introduces correlations among the parameters raising the question of how the fluctuations evolve during training. Here, we analyze the mean-field dynamics as a Wasserstein gradient flow and prove that the deviations from the mean-field evolution scaled by the width, in the width-asymptotic limit, remain bounded throughout training. This observation has implications for both the approximation rate and the generalization: the upper bound we obtain is controlled by a Monte-Carlo type resampling error, which importantly does not depend on dimension. We also relate the bound on the fluctuations to the total variation norm of the measure to which the dynamics converges, which in turn controls the generalization error.

dynamical central limit theorem, name change, shallow neural network, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing SystemsDec-24-2025, 08:43:40 GMT

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m*<=m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.

optimization and generalization, quadratic activation function, shallow neural network, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Classification of G -invariant Shallow Neural Networks

Neural Information Processing SystemsDec-24-2025, 06:14:04 GMT

When trying to fit a deep neural network (DNN) to a $G$-invariant target function with $G$ a group, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of ``$G$-invariant neural architecture design'': What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or ``shallow'' neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$, and we prove a second theorem that characterizes the inclusion maps or ``network morphisms'' between the architectures that can be leveraged during neural architecture search (NAS). The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons; the classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. The $G$-SNN architectures corresponding to nontrivial cohomology classes have, to our knowledge, never been explicitly identified in the literature previously. Using a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. Finally, we prove that architectures corresponding to inequivalent cohomology classes coincide in function space only when their weight matrices are zero, and we discuss the implications of this for NAS.

architecture, classification, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Subquadratic Overparameterization for Shallow Neural Networks

Neural Information Processing SystemsDec-24-2025, 04:36:18 GMT

Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale quadratically with the number of training data under standard initialization strategies used in practice for best generalization performance. In contrast, the most recent results obtain linear scaling either with requiring initializations that lead to the lazy-training, or training only a single layer. In this work, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Lojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory.

initialization strategy, name change, subquadratic overparameterization, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.55)

Add feedback

Learning single-index models with shallow neural networks

Neural Information Processing SystemsDec-24-2025, 02:22:39 GMT

Single-index models are a class of functions given by an unknown univariate ``link'' function applied to an unknown one-dimensional projection of the input. These models are particularly relevant in high dimension, when the data might present low-dimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (one-dimensional) subspace, are well-understood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn single-index models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the near-optimal sample complexity of dedicated semi-parametric methods.

learning single-index model, name change, shallow neural network, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.61)

Add feedback

Filters

Collaborating Authors

shallow neural network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

483101a6bc4e6c46a86222eb65fbcb6a-Paper.pdf

3fb6c52aeb11e09053c16eabee74dd7b-Supplemental-Conference.pdf

3fb6c52aeb11e09053c16eabee74dd7b-Paper-Conference.pdf

483101a6bc4e6c46a86222eb65fbcb6a-Paper.pdf

Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks

A Dynamical Central Limit Theorem for Shallow Neural Networks

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

A Classification of G -invariant Shallow Neural Networks

Subquadratic Overparameterization for Shallow Neural Networks

Learning single-index models with shallow neural networks